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ABSTRACT 

This paper has two objectives. The first objective is to formulate a 3-dimensional Finite 
Element Model for the dynamic analysis of helicopter rotor blades. The second objective is 
to implement and analyze a dual-primal iterative substructuring based Krylov solver, that is 
parallel and scalable, for the solution of the 3-D FEM analysis. The numerical and parallel 
scalability of the solver is studied using two prototype problems - one for ideal hover (sym- 
metric) and one for a transient forward flight (non-symmetric) - both carried out on up to 48 
processors. In both hover and forward flight conditions, a perfect linear speed-up is observed, 
for a given problem size, up to the point of substructure optimality. Substructure optimality 
and the linear parallel speed-up range are both shown to depend on the problem size as well 
as on the selection of the coarse problem. With a larger problem size, linear speed-up is 
restored up to the new' subtructure optimality. The solver also scales with problem size - 
even though this conclusion is premature given the small prototype grids considered in this 
study. 

The word subdtrue.f, uring and the method was intro- 
duced fifty years ago by Przemieniecki [1, 2j . Together 
with Dense [3], Argyrts and Kelsey (4), arid Turner et 
al [5], they laid the foundations of displacement and force 
finite element analysis of partitioned structures. These 
partitioned methods were the only avenues to obtain re- 
sults for practical structures for which the original prob- 
lem far exeeded the capacity of computers at the time. 
The method of substructures have remained the fastest 
(time), most efficient; (memory), most reliable (accurate), 
and most flexible (heterogenous physics and properties) 
method of partitioned analysis of large scale structures. 
The modern methods of primal and dual iterative sub- 
structuring have their origin and genesis in these original 
contributions established long before the advent of par- 
allel computers. 

The advent of parallel computers opened opportu- 


INTRODUCTION 

One hundred years ago what is now called the 
Poincare-SUifdov operator was introduced. This oper- 
ator, which governs the interface of a problem generated 
by decomposing a larger problem into many smaller sub- 
problems, have spectral properties that are superior to 
the original problem. In particular, its condition num- 
ber grows at a rate that is an order lower than that of 
the original problem. Every modern method of iterative 
substructuring is based on one or many variational forms 
of this operator. 
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nitv for solving each partitioned substructure in a sepa- 
rate processor. A straight forward implementation, how- 
ever. leads to & dead end. First, the high condition num- 
ber and lack of sparsity of the interface equation by itseif 
poses a significant challenge for convergence. Second, 
the total interface size grows both with problem size and 
with partitions, producing a dramatic increase in the re- 
quired number of iterations. The recognition, that a fi- 
nite element representation of the substructure interface 
is precisely a discrete equivalent of the original Poincare- 
Steklov operator, allows the mathematical theories of do- 
main decomposition to be brought to bear directly to- 
wards the resolution of this key problem. Today, pro- 
cedures axe available which preconditions the interface 
and solves it iteratively, in a parallel and scalable man- 
ner, requiring only local substructure calculations. These 
procedures are called iterative sub-structuring methods. 
Their objective is to provide optimal numerical scala- 
bility, be. to ensure that the condition number of the 
pre-conditioned interface does not grow with the number 
of substructures and grows at mast polylogarithmically 
with mesh refinement within each substructure. 

The mathematical theory of domain decomposition 
provides scalable algorithms for two broad classes of par- 
titioning: overlapping and non-overlapping [6J. The over- 
lapping partitioning leads to additive. Schwarz methods, 
which are valiants of block Jacobi preconditioners. These 
are widely used in fluid mechanics, but are of little or no 
importance in structural mechanics ■- due to the very 
high condition numbers (10* - 10 10 ) and high bandwidth 
of practical structures. Structural mechanics use non- 
overlapping partitioning. They lead to iterative substruc- 
turing methods, a name borrowed from, and as explained 
earlier, indicative of the deep connections to the long and 
successful tradition of substrucfe uring. 

The growth of the mathematical theory of itera- 
tive substructuring can be traced to the seminal work 
of Agoshkov et a), IT, 8] and Bramble et ai. [9| in the 
mid-1980s. The former provided a detailed analysis 
of the Poincare- Steklov operator. The latter provided 
one of the earliest scalable interface preconditioners for 
a 2-nd order elliptic problem with homogenous coeffi- 
cients. Subsequent algorithmic developments in this area 
have continued through the t990s and 2000s (the inter- 
ested reader is referred to monographs by Quarteroni and 
Valli [10] and Tbselli and Widhmd [11]) culminating in 
the increasing application of these methods for High Per- 
formance Computing (HPC) based large scale problems 
of computational mechanics [12, 13]. Today, Neumann- 
Neumann type balancing methods known as Balanc- 
ing Domain Decomposition with Constraints (BDDC) 
(see [14] and references therein) and the Dirichlet- 
Dirichlet type dual methods known as Finite Element 
Tearing and Integration (PBT1) methods (see [15] and 
references therein) provide scalable preconditioners that 
are optimal for up to 4-th order problems that include 
highly discontinuous and heterogenous subdomains. 

Berth the Neumann -Neumann and FETI methods 
act on the discrete equivalents of the Poinairi-SteMm 


interface operator. The former acts on its primal form. 
The latter acts on its dual form (explained later). Both 
are based on simultaneous Dirichlet and Neumann solves 
within each substructure - one for preconditioning and 
one for residual calculation — only in reverse order to 
one another. Note that the Neumann solves which lie 
at the heart of these methods - are ison-invertible for 
floating substructures that arise naturally from multiple 
partitioning of a structure. In Neumann-Neumann, this 
singularity occurs in the preconditioner, in FETI this sin- 
gularity occurs in the residual calculation. 

The primary objective of this paper is to apply an 
advanced multi-level FETI method, I, he FETI-Duaf Pri- 
mal (DP) method, pioneered bv Par hat et al. [15, 16], for 
the parallel solution of a large scale rotary wing struc- 
tural dynamics problem. The FETI- DP method acts 
both on the primal and dual form of the interface - each 
on. a separate portion of the interface. We apply this 
method in the present work because there is a substan- 
tia! volume of published literature demonstrating its high 
level of performance for large scale engineering problems 
(see references above). Note, however, that both the 
FETI-DP and the BDDC methods are doseiv connected, 
and we refer the interested reader to the recent work of 
Mandel et, al [17] for an excellent exposition of this con- 
nection. 

The state-of-the-art in rotary wing structural model- 
ing involves a variational- asymptotic reduction of the 3D 
nonlinear elasticity problem into a 2D linear cross-section 

analysis and a ID geometrically exact beam analysis 

based on Berdichevsky [18] and pioneered by Hodges and 
his coworkers [19]. Aeroelastic, computations are per- 
formed on the beam, followed by a recovery of the 3D 

st ress field. The method is efficient and: accurate except 

near end-edges and discontinuities for which a 3-D analy- 
sis is still needed - as long as the cross-sections are small 
compared to the wavelength of deformations along the 
beam. Modern hingeless and bearingless configurations 
contain 3-D flexible load bearing components near the 
root that- Slave short aspect ratios and cannot lie treated 
as beams. Moreover, treatment of blades, depending on 
their advanced geometry and material anisotropy, also 
requires continuous improvements based on refinements 
to the asymptotic cross-section analysis [20, 21, 22]. 

A second objective of this paper, therefore, is to de- 
velop a 3-D FEM analysis for rotary wing structures that 
can be used to analyze generic, 3-D, non-beam like hubs 
as well as advanced geometry blade shapes. With the 
emergence of rotoreraft CFD, physics-based models con- 
taining millions of grid points cat) carry out RANS com- 
putations using 100’s of cores, routinely, in a research 
environment, for the rotor, and even for the entire heli- 
copter. Current, research is focused today on. coupling be- 
tween CFD and relatively simple engineering-level rotor 
structural dynamics. The purpose of the second objec- 
tive of this paper, therefore, is to explore the possibility 
of integrating 3D FEM as the physics-based model in the 
CSD domain. 

A 3-D FEM analysis will provide enabling technol- 
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ogy for modeling advanced hingeiess and bearingless ro- 
tors. It. will provide enabling technology for calculating 
the stresses and strains directly on the critical load bear- 
ing components near the hub. It will provide a model 
that is a true representation of the 3-D structure, con- 
sistent with the high fidelity sought in large scale CFD 
computations. Thus, this research is primarily targetted 
towards large-scale, high-fidelity, HPC based comprehen- 
sive rotorcraft simulations. However, as a by-product, it 
will still provide a means for extracting 2-D sectional 
properties for generic structures, with which lower fi- 
delity analysis can always be carried out. There is no 
question that such, a capability will be powerful. The 
question is that of an efficient solution procedure. As in 
CFD, the tremendous capabilities of HPC is also envi- 
sioned here to be the key technology driver and enabler. 
The primary objective of this paper is therefore to ad- 
dress this key challenge directly. 


Scope of Present Work 

A 3-D FEM analysis for rotary wing dynamics is de- 
veloped with emphasis on identifying the inertial terms 
that are unique to rotors. The 3-D FEM solver is then 
used to implement, and analyze a parallel Newton- Krylov 
solver developed using the FETI-DP method of iterative 
substructuring. This solver is equipped with a General- 
ized Minimum Residual (GMRES) update, in addition 
to its traditional CG based update, to accomodate the 
nonsymmetric nature of rotary wing dynamics. However 
no formal proof of convergence or condition estimates are 
derived for this nonsymmetric operator. 

Advanced finite element capabilities like locking-free 
elements, hierarchical elements, nonlinear constitutive 
models, composite ply modeling are beyond the scope 
of this initial work. Grid generation is not part of this 
endeavor. Simple grids are constructed that are adequate 
for research purposes. Domain partitioning, on the other 
hand, is a part of this work. Standard graph partition- 
ers that are readily available will not suffice, for reasons 
described herein. Most key elements of a comprehensive 
rotorcraft analysis are not considered at present: airloads 
model, trim model, extraction of periodic dynamics, and 
multi-body dynamics, are all part of future work. 

The paper is organized as follows. The second sec- 
tion describes the formulation of 3-D Finite Element 
Modeling (FEM) for rotors, followed by preliminary ver- 
ification using thin plate and rotating beam results. The 
third section presents a brief description of the iterative 
substructuring algorithm and its parallel implementa- 
tion. The numerical scalability of the algorithm is estab- 
lished in this section. The fourth section introduces the 
key components of the 3D FEM analysis: geometry and 
grids, partitioning and corner selection, the hover pro- 
totype, and the transient forward flight prototype. The 
fifth section, is focussed on scalability. Analyses were per- 
formed on up to 48 processors - the maximum available 
to the authors at present. The paper ends with the key 
conclusions of this work, and a summary of the future 


research directions that are critical to the success of this 
endeavour. 


3D FINITE ELEMENTS FOR ROTORS 

The Finite Element formulation is based on well es- 
tablished, standard procedures [23, 24]. The non-linear, 
geometrically exact implementation, follows incremen- 
tal approach, using Green- Lagrange strain and Second 
Piola-Kirchoff stress measures, within a total Lagrangian 
formulation. The main contribution here is on. identify- 
ing the inertial terms that are unique to rotorcraft. 


Strain energy 

Let the deformed coordinates of a material point in 
the blade at any instant be given by 

Xi — x i + Ui(x®.X2,xf) 

x 2 ~ x% + x%, x % ) ( I) 

*3 — + «3(x 5, a;®.*”) 

where x( and u, , i = 1 , 2, 3 denote the undeformed coor- 
dinates and the 3-D deformation field respectively. The 
deformation gradient of the point with respect to its un- 
deformed configuration is then Xq 




%l t 2 

*1,3 

dxi 

Xq — 

*2,1 

X'2.2 

*2,3 

where ^ l 


. *3,1 

X? r _2 

*3.3 . 



The Green- Lagrange strain relates the deformed length, 
of a line element, dl, to its original, length on the unde- 
formed blade, <tt°, in the following manner 

(.ijdxidx j -- (l/2)[(dl)* - (dl 0 ) 2 ] 

where (dl ) 2 = dxt dxi and (dl 0 ) 2 d XidXf. The Green- 
Lagrange strain tensor follows 

e = (l/2)(<7- J) 


where G is the left Cauchy-Green deformation tensor 
given, bv 

c=x£x 0 

The elements of the Green-Lagrange strain tensor have 
the well-known form 


1 f din duj , duk dUk 

2 ' dx° dx° dcdj 


fc— 1,2, 3 (3) 


The Total Lagrangian formulation is based on virtual 
work per unit original volume. The stress measure that 
is energetically conjugate to Green-Lagrange strain is the 
second Piola-Kirchhoff stress tensor, it. That is, the 
strain energy of the deformed structure can now be cal- 
culated using the above strain arid stress measures with 
integration over original volume. The variation in strain 
energy is then simply 

SU J <7 df-ij dV 
v 


( 4 ) 



For non-linear analysis, an incremental, procedure is Fol- 
lowed. The variation, in strain energy in the current state 
is expressed in terms of incremental deformations mea- 
sured from a previous known state. It must be under- 
stood that the variation in strain in the current state is 
simply the variation in incremental strains. That is. 


dtjjit + At) — (iAc, 


(5) 


where A«y is the incremental strain 


Uf(t + At) - 




(6) 


The incremental strain is related to the incremental de- 
formations Arif. They are defined as follows 

Xi(t) ~ X® + Ui(t) 

Xj(t - 4 - At) -- x) - J a . [t -f- At) 

Aw, = Xitt + At) -- x i(t) = Ui(t + At) — 14 (f) 

Substitution of the strain expression 3 in 6 and use of 
«i(i + At) -" Ui{t) + A«i gives 


A 1 
aCii “2 


tiAil, 


+ - 


8x ( ! 

1 dA-uk dAuk 


QAuj duk dAuk : BAuk du - f , 
+ dadj dx ( j 




k= 1.2,3 


(~) 


where the linear and non-linear strains are separately de- 
noted as eij and «y. The variations follow 


SAcij — SAsij + <5A Kij 


(8) 


where 


SA&ij — 


1 ( dS&Ui dSAui duk dSAu-k dSAuk duk 

+ * ‘ r ‘ + 7T7> Tl 


2 \ dx$ dx l J dx? dx® dx { } dx® 

S j — 

1 f dAuf. dSAuit ( Q6Auk i9Afq<.\ 


2 I dx'l dx® ' dx ( - dx® J 






(9) 

Similarly 

decompose 

the stress 




<Ty{t 3 

■ At) — (Tij ( t ) + A<7 ij 


( 10 ) 

Use 8 , 9 

and 10 in 

4 to obtain 
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j 

f Aoy, 5 Ac 

i.j d V — - J dAC-ij 

dV+ 
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V 
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( 11 ) 
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j (Tij (£) S&Kij dV 4- j A&ij SAft-ij 

dV 


J 
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j 
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The integration, as before, is over the original volume. 
This expression is exact for large deformations, large 
strains, and material non-linearities. For linear elastic 


materials, the incremental stress is related to the incre- 
mental linear strain, by a constitutive relation of the form 


A Hi. 


/). , A f 


Dijmn h&s the general form 1/ O' L with D' containing 
the material constitution and L the composite ply angle 
transformation. The first term in 11 then becomes the 
standard linear finite element term. The second term is 


an incremental term involving the existing state of stress 
and linear strains. The third term, underlined, is the 
key term for rotorcraft. it is an incremental term involv- 
ing the existing state of stress and the non-linear strains. 
This term produces the structural couplings between, ax- 
ial and transverse deformations (torsion is not a separate 
state of motion in 3-D but a function of transverse defor- 
mations) in response to inertial effects. It carries within 
it the classical extension- bending and flap- lag structural 
couplings. The fourth term is dropped as part of lin- 
earization, with the assumption Dij m „e mn SKij « 0. The 
final expression then becomes 


SO — ' j ASwn S Ac dV "b 

j 
V 




dV 


( 12 ) 


Iterations are required, of course, primarily for <r^(f) but 
also for the linearization. The latter is usually insignif- 
icant. For example, for a static solution, given a pre- 
scribed, deformation- independant, non-inertia! external 
forcing, the equation of motion takes the form 


su - S W E 


where SWfj is the external virtual work. The iterative 
procedure is then 



v 


readily recognized as a Newton-Raphson iteration. If 
cry(t) is updated only cm the right hand side, the pro- 
cedure is a modified Newton iteration. For a rotor, it, 
must be updated on both sides initially to obtain the 
correct non-linear stiffness. Thereafter, modified New- 
ton is enough for the purposes of airload non-linearities. 


Kinetic energy 

The variation in kinetic energy or the virtual work 
by inertia! forces is given by 

ST =-- -6W, = f p'f- St dV 
v 

where T and Sr are the acceleration and virtual displace- 
ment of a material point P on the blade relative to an 
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inertial frame /. Let P lie in a non-inertia! frame B. 
The frames / and 8 are associated with corresponding 
coordinate axes or basis. At any instant, frame B has a 
displacement x Br relative to the frame / and an orienta- 
tion defined by a rotation matrix C ,a . C lB defines the 
orientation of / relative to B, i.e. it rotates the axis from 
B to I. If the components of x B ‘ expressed in B and f 
basis are denoted by x BI,B and x Bt,t , then 


rotational motions 11,11,(1 with respect to a non-rotating 
inertial frame 1 at the hub. B has no rigid body states 
with respect to /. Thus §x B1,a = 0. 6ip sl ? B — 0. and 

V BI/B = i; Bl/S = Q< 


<V ~H C » s *8» 

a.;,- CyC$ —CfhSs 

0 eg 


(17) 


Bin 


Recall, that the time derivative of the rotation matrix is 
related to the skew symmetric angular velocities by 

C f B — c IB w BI f B — ~q jb Bc w 

where CS IB/B is the angular velocity of B relative to / 
and measured in B, and w B,/> is the angular velocity 
of / relative to B and measured in I. The components 
of the motion of the point P relative to I and B and 
expressed in 1 and B frames satisfy 


ip — fit is the blade azimuth angle, 8 is the control angle, 
and ce — cos 6 .. . etc. 



Qs e i e c (18) 
fie* 


e B and e ! are the basis vectors in the B and I frame 
repectively. The non-zero components of f pl - rl the 
kinetic energy expression 16 become 


r Pip — c lB ix° l!B + r PB ! B \ 

j.PI /1 _ qJB^,BI/B + f-PB/B + 

fPI/I + qJB/B v BI/B jtPB/B _ i_ 

~,BI/B f P 8 /B + ( 5 B//B (S B//B r PB/B + ~Bt/ B r P 8 / B ■ 

( 14 ) 

where the frame motions have been expressed in body- 
fitted coordinates as x ni/1 — v B, ' t » C ,B v B! ' l> and 
x Bl ' 3 = c ;,i v Bl, ' B +C IB (j jB '' B v Br / B . The components 
of virtual displacement are 

Sr* 1 ' 1 = C !B {Sx B! ‘‘ B + S^i/BfPB/B + Sr Pa/ B) 

---- C IB (SX BI ' B - fPBIBfyBlfB + Sr PB/B ) 

= C IB R T Sq 


jtPB/B _ 


2 CjBl/BfPB/B = 2 


0 

fie* 

--Os* 


til 

ih 

m 


— ftc* Os* 
0 -8 

0 0 


(19) 



jBl/B PB/B _ 

-ft 2 kls 0 mice 

-fi 2 c| - $ 2 0 2 c*.9* 

8Q.cq 0 2 cqS(i — 0 2 s| — 0 2 



(20) 


( 21 ) 


(15) 


where 



R‘ 


x BI/B and ’<p B1 ’ 8 are the rigid body translational and 
rotational states of frame B. The virtual displacement 
§ T l- B/B | n u f rame j 8 written in terms of its finite element 
degrees of freedom Sr PB ^ B — t- The kinetic energy 
is then 


' 

f Zl \ 


r. 


K X 3 / 

( 22 ) 


ST j (Sr 


pi n \T -Fiji 


dV 


flfi) 


In general, frame B may contain a general flexible cora- 
ponent. Consider a simple case for illustration. Let B be 
the undeforrned blade rotating frame, containing the en- 
tire blade, with origin at the rotor hub. It undergoes con- 
trol angle motions 0, 8,9 with respect to another rotating 
frame, R, with origin at the hub. This frame undergoes 


0 —fie* + 006** fis* + Ode# 

fie* — {Wsd 0 0 

-0% - SWc ff § 0 


The virtual displacement is simply the variation of 
the incremental displacements 


Sr-PB/B = C IB Su 


Thus, the kinetic energy 16 takes the following form 

{ 


ST — j 5u 4 p (T m ii 4- T c u -I- T'k t 


f}dV 


24) 


For example, with the simplest assumption of only a col- 
lective, 0 = 0, and steady rotation, ft — ■ 0, we have from 
19 - 22 the following mass, damping, stiffness, and force 



contributions from inertia. 
T m = h 


T- = -20 


t c # -$e 

Cf; 0 0 


L »» 


0 0 


T k = -O 2 


0 

o 4 


o 

-s e cg 


0 -SsCg sj ] 
T { =T k [x\x%x%\ T 


(25) 


Virtual work by external forces 


A surface lies, always, on one or more of the element 
faces, defined by its natural coordinates £ + 1, r/ ± 1, or 
( ± 1 {see following section), A differential change d£ 
in the natural coordinates (£,??, 0 creates the following 
changes in the geometric coordinates (xi,xi,x^). 


jV 

dx t = x iA d4 = £ H k4 x f i - i, 2, 3 (26) 

*=i 

Similarly, differential changes dr; and d£ create the vec- 
tors x,qdr) and x 4 ( in the geometric coordinates. The 
area d£di), for example, then corresponds to the area of 
a parallelogram between the two vectors x 4 d£ and x^dn 
in the geometric domain, defined by their cross product 
or cross product matrices: x 4 x x. n — -x^x rj -- x 4 x. n 


stiffening known as element locking as the element thick- 
ness tends to zero. A simple but effective way to prevent 
locking is to use higher-order elements - as in this study 
containing sufficient number of internal nodes. Devising 
efficient lower order locking-free brick elements, based on 
reduced-integration or enhanced assumed strain meth- 
ods, are beyond the scope of this initial development. 
The primary focus at present is on accuracy. 



Figure 1; 27-nodo, isoparametric brick element in 
curvilinear natural coordinates; 64 gauss integra- 
tion points. 


X 4 X X.,j = 


■C'2,tX^ rf ^'2 ,r ;-' f 

*3,^1, n “ * 3 ,n*u 

X\ 4 X2,t) “ XijfX-z.r 


Por example, if normal pressure p on an elemental face 
defined by £ —constant, produces a virtual work p dA ■ Ail, 
the components of dA are simply x 4 x, n d£ dr;. The com- 
ponents of Su are the incremental virtual deformations 
rfftti to «sF = 8q T H T , where H are structural shape 
functions (see next section). The virtual work expres- 
sion is then 


SW E - Sq T j pH 7 x 4 x r . dfdri = 5q r Q (27) 
A 

A general surface stress distribution a$ is incorporated 
in exactly the same manner using (as - dA) • Ail. 

SWe — Sq 7 f H T as x 4 x, n d £ dp — Sq l Q (28) 


Because fluid stress are deformation dependant, the ares 
must be evaluated at the deformed configuration. The 
deformed configuration is not known a priori, therefore, 
iterations are required. This is discussed further is the 
subsection ‘Steady Hover Prototype’ under ‘3-D PEM 
Rotor Analysis’. 


Brick Finite Elements 


Figure 1 shows a quadratic, Lagraitgian, iso- 
parametric brick element developed in this study. It con- 
sists of 8 vertex nodes and 19 internal nodes - 12 edge 
nodes, 6 face nodes, and I volume node. Within isopara- 
metric elements, geometry and displacement solution are 
both interpolated using the same shape functions. Thus 


N 





N 

At* = J2 H « 

a -- 1 


(29) 


and therefore 


N JV 

X i — y: fig Xia. Ui -- y~[ Hg uf (30) 

a— L a=I 

where a is the elemental node point index, N 27 here. 
The shape functions are expressed in element natural, co- 
ordinates 0 v„ and (. We consider Lagrange polynomials 
in each direction. 


H a = - L?(0 LJ(n) l? K (<) (31) 

In the present formulation n — m = p — 2; and / , J, K — 
1,2,3. The second order Lagrange polynomials in, say 
x, are x(x — l)/2, 1 — z 2 , and x(x + l)/2. For example, 
based on the focal node ordering shown in the figure, we 
have the shape function corresponding to node 11 as 


The analysis of bending dominated problems involv- 
ing thin structures using 3-D elements suffer from severe 
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The strains require the derivatives of the shape functions 
with respect to geometric coordinates 
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The required derivatives are then 
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Henceforth the above derivatives are denoted as 
H a ,t,H sl _ 2 ,H a .[ i and ll a ^.H a . n ,H a ^, In addition, the 
first quantity in 32 is denoted by Lj, i.e. 
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These are calculated front the derivatives with respect to 
natural coordinates as follows 


The evaluation of the Jacobian, J, is straight forward 
using the derivatives of the shape functions with respect 
to element natural coordinate axes and the location of 
the nodal points (i.e. the grid points). 


Prom the linear strain Aey as defined in 7, the strain- 
displacement relation now takes the following form 

Ae = (JJjc,o + Bn'jAq (38) 

where 

Ae J — A(en £22 £33 2eia 2«as 2eia} 

Aq r — [«U U2 1 MSI «12 «22 «S2 ■ ■ ■ Ml V Wav Msv] 

(37) 

and the expressions for Bio and Bli are given above. 
The first term in the strain energy 12 now becomes 

■SAq r ( j (Bta + B u f D (B !M + B lA ) dV J Aq 

(38) 

The second term is treated is the same manner to obtain 
<)Aq T ^ j (Blo + B u f a(t) dV j Aq (39) 

where ff(t) = (<rn (722 &:a vn crarj ctvsf - Now consider 
the non-iinear incremental strain Afsy as defined in 7. 
Clearly, the strain is non- linear and a strain-displacement 
relationship cannot be found. However it has a quadratic 
form, find hence can be re-arranged as 


SAf | f J S n. 


Inl (t) b nl dV j Aq 
with the matrices having special forms 
Bit) ■ 


(40) 


" rr(f) 0 

0 

f « 

0 

0 ] 

0 <r(*) 

0 

: 0 = ! 0 

li 

0 

0 0 

tr(t) 

L° 

0 

0 j 


a 

B Nt = 

' e,v/. 

0 

0 

B.v/. 

0 

0 

; 0 = 

r ::i 



0 

0 

Bnl . 


J 



. Thin plate 



0 0 


Figure 2: A cantilevered 

plate using (4,4,4) brick el- 
ement grid 



Solid cube 


> 



Figure 3; Verification of brick elements; Natural fre- 
quencies of a (4,4,4) solid block approaching Kirchoff 
plate frequencies (symbols) with reduced thickness 
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The volume integrations are performed using 4 gauss 
points along each natural coordinate axes, a total of 64 
integration points. Note that 

dV = det(J) dt, dr\ 


Mode 

number 

4x4x4 

Bricks 

8x8x4 

Bricks 

Kirchhoff 

Plate 

i 

3.55 

3.50 

3.47 

2 

8.68 

8.53 

8.51 

3 

22.93 

21.58 

21.29 

4 

28.16 

27.29 

27.19 

5 

32.84 

31.19 

30.96 

6 

56.78 

54.31 

54.13 

7 

71.19 

63.09 

61.29 

8 

74.75 

64.96 

64.16 

9 

83.68 

72.50 

70.98 


Table 1: Plate frequenci es usin g 3D FEM: nondi- 
mensionalized w.r.f . y D/pla 4 , m plate dimen- 
sion, t: thickness, pi density, D = Et s /V2(i - v 2 ), 
Ei Young’s Modulus and u: Poisson’s ratio 


VERIFICATION of 3-D FEM 

A preliminary verification of the 3-D FEM model is 
carried out by reproducing non-rotating thin plate fre- 
quencies, and rotating slender beam frequencies. The 
former verifies the locking-free behavior. The later veri- 
fies the non-linear implementation. 


section grid 

Torsion 1 

Torsion 2 

1x1 

1.710 

5.132 

2x2 

1.586 

4.758 

3x3 

1.577 

4.733 

4x4 

1.576 

4.728 

5x5 

1 .575 

4.726 


Table 2: Beam torsion frequencies vs. cross- 

section grid refinement; 8 span wise elements; 
nondimensionalized w.r.t. y TTJJTl?', EB val- 

ues are 1.571 and 4.712; Beam dimensions are 

L x c x e, L — 100c 


1 Thickness 

8 

16 

20 

i 

3x3 

3 x 3 

3x3 

I c 

4.733 

4.726 

4.725 

| c/2 

4,757 

4.746 

4.742 

| c/4 

4.824 

4.794 

4.784 

1 c/8 

4.886 

4.839 

4.824 


Table 3: Second torsion frequency vs. spanwise 
grid refinement; 3 x 3 cross-section; nondimen- 
sionalized w.r.t. , JGJ/IL 2 ; EB values are 1.571 
and 4.712 

Thin Plate Frequencies 

The locking-free behavior of the brick elements, in 
shear, is verified by re-producing Kirchhoff plate frequen- 
cies for a thin plate, using a (4,4,4) brick mesh Fig, 3. 
The plate frequencies are obtained from converged 2D 
rectangular plate finite elements (20,20) and are vali- 
dated easily with documented classical solutions. The 
discrepancy at; the higher inodes are resolved using a finer 
mesh converged solution, see Table 1. The residual dif- 
ferences are due to shear, not present in the Kirchhoff 
solution, but present in the brick solution. 
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(b) Mode 4: L&g I 



(c) Mode 5: Flap 3 (d) Mode 6: Torsioii 1 



{e} Fan plot showing rotating frequencies 



Figure 4: Rotating frequencies and mode shapes of an uniform hingeiess blade of aspect ratio 15, 

thickness 25% chord, and rectangular cross-section; 3-D bricks vs 1-D beam; 16 x 3 x 3 grid 


Aspect 

Ratio 

Torsion 1 

Torsion 2 

100 

1.598 

4.794 

40 

1.599 

1.799 

20 

1.603 

4.813 

15 

1.606 

4.826 

10 

1.615 

4.860 

8 

1.622 

4.890 

6 

1.635 

4.919 

5 

1.645 

4.991 

4 

1.662 

5.064 

3 

1.794 

5.478 


Table 4: Torsion frequencies vs. aspect ratio; 16 x 
3x3 grid; nondimensionalized w.r.t. yjGJ/IL 2 : 
RB values remain 1.571 and 4.712 for all aspect ratios. 

Slender Beam Frequencies 

Next, the 3D element is verified using a slender of 
uniform geometry that behaves as a beam over a reason- 
ably targe variation in thickness and aspect ratio. The 
bending frequencies are easy to re-produce, torsion in 
general requires greater resolution. An uniform slender 
beam of aspect ratio 100 and square cross- section, i.e., 
the dimensions are 100c, c, and c, in length, width, and 
thickness, is considered. The torsion frequencies converge 
towards Euler-Bemoulli numbers with two to three cross- 
section elements 2. The remaining difference stems from 
spanwise resolution. 

The effect of spanwise resolution is shown in the first 
row of Table 3, starting from 8x3x3 grid as the base- 
line. Spanwise resolution becomes more important as 


Mode | 
no. 

Beam freqs. 
(/rev) 

3-D freqs, 
(/rev) 

Mode 

type 

i 

0.826 

0.824 

Lag I 

2 

1.059 

1.058 

Flap 1 

3 

2.768 

2.769 

Flap 2 

4 

5.058 

5.006 

Lag 2 

5 

5.21.1 

5.223 

Flap 3 

6 

| 6.433 

6.625 

Torsion 1. 

7 

!• 8.542 

8.597 

Flap 4 


Table 5: Rotating frequencies for a soft- inplane 
hingeiess rotor; 16 x 3 x 3 grid in 3-D; L: Lag. F: 
Flap, T: Torsion 

the beam thickness sa reduced. This is shown in the sub- 
sequent rows and columns of Table 3. The rows show 
the variation of torsion frequency with a progressively 
thinner beam. The columns show the effect of spanwise 
refinement for each thickness There is increased devia- 
tion from Euler- Bernoulli values as the thickness reduces. 
With the thickness fixed at c/4 and the grid at 1 6 x 3 x 3. 
the aspect ratio of the beam is now progressively reduced. 
Table 4 shows that from 100 to 20 the frequencies remain 
relatively constant. At aspect ratio 5 there is still only 
an error of 5 — 6%. This deveation is expected from the 
physics of the problem and is not an error in the FF.M 
formulation. 

Consider the configuration with length 20c, width 
e, and thickness c/4. The rotating modes of this simple 
beam structure are shown in Fig. 4. The frequency plot, 
Fig. 4(e), shows the the beam frequencies are almost ex- 
actly reproduced by 3D FEM - and the small grid size 
of 16 x 3 x 3 is adequate for this simple problem. Note 
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that this serves as & verification of the non-linear for- 
mulation.. The torsion frequency shows an error of 5% 
consistent with the deviation front Euler-ReniouUi fre- 
quencies in the non-rotating case for this level of grid 
refinement. 'Bor this structure, the torsion frequency Is 
relatively high, and occurs only as the sixth mode. The 
rotating frequencies at 0 = 27 rad/s are given in Tkble 5, 
both for a beam and the 3-D analysis. 

ITERATIVE SUBSTRUCTURING USING 
PARALLEL KRYLOV SOLVER 

A parallel Newton- Krylov solver is developed to pro- 
vide an efficient and scalable 3-D PEM solution. 

Large-scale structural dynamics problems are solved 
most efficiently using the method of substructures. Sub- 
structuring involves partitioning a structure into non- 
overlapping subdomains. It is the most accurate method, 
because each subdomain can have its own internal solver 
depending on its local condition number. Almost always, 
a direct, factorization is preferred. A real structure con- 
tains significant heterogeneities - thin and slender geome- 
tries, plate and shells by biharmonic. PDEs, 3-D bricks 
with high bandwidths (> 1000), non-linear materials, 
and constraint forces - factors that routinely give rise 
to condition numbers of 10 s - lO 10 . 

Modern methods of iterative substructuring pro- 
vides a domain-decomposition based preconditioned it- 
erative solver for the interface problem. The interface 
problem need not be of primal type, but can also be of 
dual type. A primal problem consists of variables that 
are a subset of the original unknowns, e.g. the displace- 
ments at the interface. A dual problem consists of vari- 
ables that are not a subset of the original unknowns but 
whose equality must still be gauranteed, e.g. the reaction 
forces at the interface. Depending on the problem dual 
variables differ - bending problems will involve transverse 
shears and moments, whereas a plane stress or strain 
problem will onvolve only in-plane stresses. Regardless 
of type, whether primal or dual, all finite element inter- 
face descriptions are precisely the discrete equivalents of 
the Poincarh-Steklov operator. 

The interface has attractive spectral properties as a 
result of which it is more amenable to iterative solve. In 
particular, unlike the substructures themselves, the con- 
dition number of the interface problem grows at a rate 
that is an order slower compared to the original prob- 
lem, e.g., by 0(/r ! ) for second order and by 0(ft -3 ) for 
fourth order PDEs. However, it also grows, necessarily, 
at where H is the subdomain size. For a second- 

order, elliptic, positive definite, and coercive operators, 
the precise number for uniform finite element meshing 
given by [25] 


where H is the maximum and H m is the minimum sub- 
domain diameter. The main objective of iterative sub- 
structuring is to provide preconditioned such that the 


preconditioned interface problem has a condition num- 
ber independent of both h and H. Such a preconditioner 
is an optimal preconditioner. At the same time, it must 
be constructed in parallel, subdomain by subdomain, re- 
quiring only communication between subdomains but no 
other serial operation - otherwise the primary purpose 
of Iterative substructuring is defeated. 

The dependant* on 0 (I/~ 2 ) cannot be prevented 
without a coarse grid solver — communication only be- 
tween neighboring subdomains will always show this de- 
pendence. Thus, a coarse problem i.e. a general mecha- 
nisms to propagate local information globally, is a central 
requirement of any scalable solver. 

The general building blocks of a preconditioned 
Krylov solver ( for solving M~ l Ax = M~ l b are: (I) 
residual calculation r = h — Ax, (2) preconditioning 
M~ l r and, (3) a matrix-vector multiplication Ax. An 
iterative substructuring algorithm provides these build- 
ing blocks in a subdomain independant, parallel manner. 
Once the building blocks are provided, constructing a 
Krylov solver is trivial. Unlike the CG update, the GM- 
RES update, however, poses its own parallelization issues 
due to the Arnold! procedure. 

The FETI-DP Algorithm 

In iterative substructuring, the subdomain interface 
nodes are first separated into vertex, edge and face nodes. 
The vertex nodes and a subset of edge nodes art; then 
designated as corner nodes. In FETI-DP, the degrees 
of freedom (DOFs) associated with the corner nodes are 
formulated as a primal interface. The DOF's associated 
with the rest are formulated as a dual interface. The 
corner nodes form a coarse problem that propagates lo- 
cal subdomain information globally. Because the number 
of DOFs associated with a corner depend on the order of 
the problem, i.e. 3 DOFs for 2nd order brick FEM or 6 
DOE's for 4th order plate or shell elements, it automati- 
cally renders the coarse mash appropriately denser with 
increase in order. The FETI-DP method and its imple- 
mentation in this study is entirely based on the work of 
Refs. [15, 16]. A detailed exposition of our implementa- 
tion. is not provided here. A brief description is provided 
below summarizing its key aspects. 

For a given subdomain, if its nodes are re-ordered 
with internal nodes I s first, followed by interface nodes 
F|, and then comer nodes JPj~ (for a selection of corner 
nodes in 3-D, see next section), then a subdomain matrix, 
gay the stiffness matrix, takes the following form 


K* - 

[ Kh 
K%, 

Kfs 

K%e 

k ;,v 1 

il 

K s 
n RV 


L Ky r 

K've 

J 

| l “VR 

ft vv J 


( 41 ) 

where the internal and edge nodes are denoted together 
as R: s nodes. The subdomain forcing, /*, and unknowns, 
t I s , are correspondingly 
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where 



Two Boolean restrictions are defined for each subdomain. 
The first Boolean restriction, B R , restricts u R to w|.. and 
assigns a +1 or —I sign such that equality of the interface 
degrees of freedom are gauranteed upon convergence. 

E B R U R = 0 


The second part of the residual r% is obtained after the 
coarse solve 

f 2 . — I‘JtV tty = ^ ffj.T Xll KftyBytly (49) 

S 

The residual is then r = r t + r 2 . Note that the resid- 
ual calculation is based on subdomain Neumann solves. 
Therefore, the subdomain partitioning, and corner node 
selection must ensure null kernels. 


The summation sign denotes assembly over subdomains. 
The second Boolean restriction., Bf : , restricts the global 
corner nodes to subdomain corners. Note that, for a re- 
ordered subdomain, the first restriction, B% has the form 
and size 


B'r = 




1 

n 

i 


(44) 


Preconditioner 


The residuals are used to construct subdomain fluxes 
rf = K%, w* + K% b B%r% (50) 

with %u* obtained using subdomain Dirichlet solves 

Kf,w* = Kf r ~ l Kf E B%r% (51 ) 

from which the preconditioned residual follows 


where B% is a diagonal matrix with entries +1 or — 1. 
The dual-primal procedure computes a set of dual vari- 
ables (auxiliary variables that are not, part of the original 
problem} which on convergence allows the recovery of the 
subdomain internal and edge nodes as 

KWr = /r- B% r A* (45) 


M"' l r - V B 


s „S 


(52) 


Expanding the above expressions, we have, formally 


m- 1 -E s i 


0 s% B j ij « 


and the global corner nodes as 

K\,y u% = Fl v X + r v (46) 

The corner problem, which propagates error globally, is 
a coarse grid problem that is also constructed subdomain 
by subdomain. Formally, 


where S% E are the subdomain Schur complement matri- 
ces. A more efficient preconditioner (but not optimal) is 
obtained by skipping the Dirichlet solve above and cal- 
culating the fluxes directly as 

r/ s — K ee B%r% 


__ D5 T 1 CSS ISS T Tf 9 — ) rfS | OS 

K vv - / r ts v [ A vv ~ H rv k rr h m-'j 

35 

Fb A = V B{ r [~K% v r K% M 1 B% t ] X s (47) 

ft 

tv = E B v r [Kttv T K%R- 1 f% - fv] 

S 

The solve, however, is carried out, in every subdomain. 
Thus, before the interface iterations begin, the subdo- 
main contributions to the left hand side of the corner 
problem are constructed and factorized in every subdo- 
main, and globally communicated. Thereafter, during 
the interface Krylov iterations, the coarse problem solve 
is only a repeated right hand side solve 

The building blocks of the Krylov iteration: residual, 
calculation, preconditioning, and matrix-vector multipli- 
cation procedure are briefly stated below. 


This leads formally to 


AT 1 = 




» 


0 0 
o K*ee 


os T 

Mr 


(54) 


The subdomain Schur complement matrices have been 
approximated here by their leading terms. The two pre- 
conditioners above are termed the Dirichlet and Lumped 
preconditioners. All results shown later in this paper use 
the Dirichlet preconditioner, even though for 3-D brick 
problems the Lumped preconditioners are known to be 
faster. 


Matrix- vector multiplication 

This is identical to the residue calculation, except 
/jj — Bf j . 1 X s is now replaced with B Jj p s . Here p* is the 
subdomain restriction of the vector to the multiplied. 


Residue calculation 

The first part of the residual su is obviously 

n = E b r< 

& 

— E tk ~ 

if s 




B%K% r ~ 


' X s 


(18} 


Numerical scalability 

For a symmetric and coercive elliptic operator the 
condition number of the preconditioned FETI-DP inter- 
face problem can be shown to grow as 
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Figure 5: A 4 x 4 plate partitioning; 16 el- 
ements in each partition. Interface, cor- 
ner, and boundary nodes shown in red, 
blue, and green respectively. 



Figure 6: A 16 x 16 piate partitioning; 16 

elements in each partition. 

, That is, if the subdomains have size II , and the finite 
element mesh has size k, then the condition number of 
the interface grows only as II jh. The condition num- 
ber determines the iteration count required for conver- 
gence. For optimal scalability algorithm, the iteration 
count does not grow with the number of subdomains as 
long as the. mesh within each subdomain is refined to 
keep H/h constant. Thus, a bigger problem with addi- 
tional subdomains require the same iteration count as a 
smaller problem as long as both contain the same mesh 
resolution . 

The optimality of the algorithm is verified on a piate 
bending problem. Plate bending, like beam bending, 
is governed by 4-th order partial differential equations 
and is considered a challenge for iterative solvers because 
their condition numbers grow at a rate Q(h~ 4 ). Tables 6 
and 7 show the change in iteration count with decreas- 
ing values of h and H respectively. The iteration count 
increases with increase in H/h and decreases with de- 
crease io H/h. However, if H/h is held fixed. Table 8 
shows, as desired, the iteration count remains relatively 


constant. Thus, the two piate problems shown in Figs. 5 
and 6 converge at the same rate - even though the latter 
is four times as big as the former. 


th 

n s 

tefc/ 

FETI-DP 

CG 

FETI-DPl 
GMRES j 

1/8 

16 

216 

18 

21 1 

1/12 

16 

468 

23 

26 

1/16 

16 

816 

26 

31 | 

1/20 

16 

1260 

29 

36 

1/24 

IS 

1800 

32 

39 

1/32 

16 

3168 

37 

46 


Table 6: Number of substructures fixed n s =16 (4 x 
4 mesh partition). Iteration count vs. increase in 
problem size. 


H 

n x 

Ti-dof 

FETI-DP 

CG 

FETI-DP 

GMRES 

h 1/3 

9 

1260 

31 

46 

1/4 

16 

1260 

32 

41 

1/6 

36 

1260 

29 

33 

1/8 

64 

1260 

25 

29 

1/12 

144 

1260 

21 

23 


Tabie 7 : Problem size fixed DOFs=120O (h=l/24 
mesh). Iteration count vs. incrase in number of 
substructures. 


h 

n. 

H'dof 

FETI-DP 

CG 

FETI-DP 

GMRES 

1/12 

9 

468 

23 

29 

1/16 

16 

816 

26 

31 

1/20 

25 

1260 

28 

32 

1 /24 

36 

1800 

29 

33 

1/28 

49 

2436 

30 

34 

1/32 

64 

3168 

30 

34 


Table 8: Problem size per substructure fixed 

H/h~ 4 (16 elements per substructure). Iteration 
count vs. increase in problem size. 

Parallel Implementation of CG 

A standard Conjugate Gradient (PCG) update is as 
shown below. The main building blocks that are con- 
structed using the parallel FETI-DP procedure are high- 
lighted in bold. 


Ao 

= 0: 

ro = 

O 

| 

T3 



for 

k = I 

, 2 ,.. 





Zjfc-s 

= M 

1 r k _i 




Cfc = 

(~ T 

l ‘■k- 

Tte-i j / Ui 

L 2 ffc-2j 

with = 0 


Pk =' 

Zk- 1 

+ skPk-l 

J/Pkj 

with pi z. 


7k = 

/ T 

\4- 

\ l / f" 



X k — 

Afc-i 

+ 7 kPk 




'f'k - 

ffe-i 

— JkF pk 








end 

In addition to the communication requirements for 
the FBTI-DP, the CG update requires processor synchro- 
nization points of its own. These axe points beyond which 
calculations cannot proceed unless all processors reach 
that: point. All vector inner products are synchroniza- 
tion points. The two synchronization points are under- 
lined above. An additional synchronization point is re- 
quired to calculate the norm of the preconditioned resid- 
ual !!«*_, )j 2 , to determine the stopping criteria. In the 
case of CG, the total number of points can be reduced to 
one, using advanced norm estimation techniques [26, 27], 
This refinement has not been included at present, but it 
is desired when thousands of distributed memory nodes 
are eventually used. 

Parallel Implementation of GMRES 

A GMRES update require an Arnold! procedure and 
a solution of a least-square problem. A Reorthogonal- 
ized Classical Gram-Schiindt Arnold! procedure is imple- 
mented in this study, based on the seminal work of Daniel 
et al. [28} . On the one hand, this algorithm produces high 
levels of orthogonalization (down to machine precision) 
and is superior to Modified Gram-Schmidt [29]. Ort the 
other hand, it remedies the unacceptable communication 
costs of the latter (explained below). Note that, a Clas- 
sical Gram-Schimdt, (i.e., without Reorthogonalization) 
Is numerically unstable and is not used in practice. 

Recall, given an initial estimate Xa and residual 
r o - h ~ Ax o, every m-th GMRES iterate for the so- 
lution of Ax — b is given by x m = zo + K where I< lies 
in the Krylov subspace of dimension m associated with 
A and r 0 , K m (A,r 0 ) — aparifro, Ar (h , . . , A m ~ y ro}. and 
minimizes the norm [16 — Ax ||j. 

The Arnold! procedure in dimension m constructs 
an orthonormal basis V m = \pi . t> 2 , , . . , i* m J of the Krylov 
subspace K m (A,ra). The procedure also generates a ma- 
trix H m of size (m + 1) x m the top m x m block of which 
is an upper Hessenberg matrix H m . The m-th iterate is 
computed as x m = z< } + V m y m where y m is calculated 
such Xm minimizes jjft — Ax m \\a. This amounts to cal- 
culating a y m which minimizes \\0ei — H m y m \ la where 
0 = | [rolja and cj is the first canonical vector of 
A QR factorization — employing Givens rotations is 
used here to solve this least squares problem. 

The classical GMRES method expands the Krylov 
subspace dimension, iteration after iteration, to n and 
terminates in at most n iterations. Each iteration re- 
quires every previous basis vector. A restarted version 
of GMRES, on the other hand, restricts the expansion 
to, say, m dimensions and restarts the Arnold! proce- 
dure using x m as it now initial guess. These restarts are 
called the outer iterations. Given below is the restarted 
GMRES(m) algorithm. 

A(> = Q; Fo — d - i' A 0 


2b = M Vo 
6 — H^olb 

for k — 1,2,... till convergence 
ci = Zk-i/@ 

Arnoldi procedure 
Least-square solve of order m: 

Calculate y m to minimize 
min - H m y j| 2 . 

Use QR factorization of ff,,, . 

Afc = + V m Um 

r k = d - FAfc 
z k = M -1 r k 
0 - \\Zk\\2 

end 

The Arnold! procedure is the heart of the algorithm. 
It requires three steps. For a given A and an initial vector 
. the rn basis vectors are constructed as: 

for j = 1, 2, . . . , m 

(1) Basis expansion: uu+i = Avj 

(2) Orthogonalization: orthogonalize 
with respect to all previous Arnolds 
vectors (iq , wj . . . . , Vj) 

(3) Normalization: hj+ij — ||«tty+j ||2 
and v j+1 = w j+ i/h j+lJ . 

Orthogonalization is the main step. Traditionally a 
Modified Gram-Schmidt algorithm is always preferred at, 
this step (over Classical Gram-Schmidt) because of its 
numerical stability. It, is as follows. The synchronization 
points are underlined. 

for j — I , . . . , m 
w — Fvj 
t ~ M~ l w 
for * = 1, . . . ,j 
hi.j = V I & 

t = t- hijVi 

end 

hj+ij = jWh 

??j4-l — t / H . . 

end 

The first set of points, i.e. the «/ 1 calculations, 
presents a high communication requirement. Within 
each step j. the vector t, once generated, is immediately 
projected to, and subtracted from, each and every one of 
the previous Arnolds vectors tv. Each projection, a vec- 
tor inner product, requires a global synchronization. In 
a Classical Gram-Schimdt, the projection and the sub- 
struction steps could be carried out separately, with a sin- 
gle synchronization step in-between. However, because 
Classical Gram-Schimdt is unstable (though mathemat- 
ically equivalent to Modified Gram-Schimdt), a second 
orthogonalization step is needed. Thus, the final Re- 
orthogonalized Classical Gram-Schmidt, algorithm is as 
follows. 
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w = Fv-j 
t = M'Hv 

for i — 1 ,j 
hij = vft 
end 

Global synchronization 1 
for i = L , - . ,j 

t = t — hijVj 

end 

for i — l ,.,. ,j 

K.j = #£ 
end 

Global synchronization 2 
for i = 1 , ,j 

l = t ~ h> i.j V S 

end 


end 


h — h + h’ 
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Figure 8: Cross-section of prototype rotor 

blade; 5% t/c, exaggerated scale; 4 x 4 bricks 
with internal nodes. 

In addition, for research purposes, we will consider only 
small size problems, with, which a large number of cases 
could be constructed using the 48 processors that were 
at our disposal. 

On the other hand every fundamental aspect of the 
physics of the structural dynamics of an isolated rotor 
blade is incorporated. And, the parallel solution proce- 
dure is generic, i.e. it, is independant of the type of air- 
loads, control angle variation, grid, and material consti- 
tution. The objective is to study the numerical scalability 
of the Newton- Krylov solver and the practical scalability 
of its present implemention. 


o.i R 0. 1 R 



R = 15 c 


Figure 7: P Uniform of prototype rotor blade; 
c — 0.53 in: two different spanwise grid resolu- 
tions used in this study are shown. 


3-D FEM ROTOR ANALYSIS COMPONENTS 

In this section, the main components of the 3-D ro- 
tor FEM analysis are described. They are the geometry 
and grids, partition arid corner selection, steady hover 
prototype, and the transient forward flight prototype. 

These are mere prototypes because we do not use 
real airloads, do not have a trim mechanism, and do not 
have at present a true representation of a blade structure. 


Geometry and Grid 

We consider a hingeless rotor blade, discretized as 
shown in Pigs. 7 and 8. The simple grid generator for 
this study requires that the cross-sectional discretization 
remains the same along span and that all sections be 
solid. With these assumptions, it is straight forward to 
accomodate an arbitrary variation of airfoil shape, twist, 
planform, and advanced tips. A key limitation at present 
is that only one continuous structure can be gridded. 
Grid generation, however, is not the focus of this work. 
It is assumed at the suitable grid will be available to the 
solver from other sources. 

The surface geometry, required for external farcing, 
is defined by the sectional airfoil coordinates. We use a 
generic, symmetric airfoil with 5% thickness (Fig. 8). 

We consider a set of four finite element discretiza- 
tions. Ht x r (2 x ns refers to numbers of elements along 
span, chord, and thickness. Note that each element con- 
tains 81 degrees of freedom and 64 integration points, 


Grid 

Til x «2 x n s 

Total DOFs' 

i 

48 x 4 x 2 

12,960 | 

2 

48 x 4 x 4 

25,920 

3 

96 x 4 x 2 

25,920 

4 

96 X 4 x 4 

46,656 


Table 9; 3D FEM Rotor Grids 

Each finite element, naturally, can accomodate its 
own constitutive material model and ply direction - we 
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use simple isotropic properties: E — 73 GPa; ft — 0.3; 
and p = 2700 kg/nr 5 (corresponding to Aluminum). 

Along with s he dimension c = 0.53m. these gener- 
ate similar order of magnitude non-dimeuaional vain® 
of stiffness and inertia as soft inplane hingeless rotors. 
No attempt is made to place the sectional offsets with 
respect to quarter-chord. Thus, the blade may not be 
dynamically stable. However, the values will generate 
typical deflections with typical airloads. The airloads are 
an uniform 4000 N/m 2 baseline (around 375 Ib/ft along 
span) baseline, and two and four times that amount, to 
generate large deformation cases. The airloads imposed 
on the blade are prescribed they act only on the top sur- 
face and they are normal pressure airloads. Thus, they 
have the non-linear characteristics of a follower force 
Le. they act normal to the deformed surface which is 
not know a priori. The rotational speed is fl = 27 rad/s 
(steady). 

Grid Partitioning and Corner Selection 

The partitioning requirements are unique - not just 
any partitioner will do. The generation of subdomain 
grids from a global grid via a re-calculation of the finite 
element connectivities is straight-forward. Partitioned 
are widely available in public domain, that carry out 
this task intelligently ensuring an optimal balancing of 
processor loads. However, for structures, this is not the 
most important requirement. The most important re- 
quirement is that the the coarse problem be picked to 
ensure a null kernel in every substructure. 



Figure 10: Substructure node designation for 
one-way partition. 

The partitioner we develop as part of this research 
is simple in that it handles only the brick elements we 
developed, and it makes the same assumptions on the 
grid type as our simple grid generator. Namely, that the 
cross-sectional grids must remain same throughout the 
span, regardless of variations in geometry. 

Figure 9(a) shows a type of generic 2-D partition 
that is used in. the present study. The blade can be di- 
vided into any number of substructures in the spanwise 


Boundary nodes 



Figure 11: Substructure node designation for 
two-way partition. Corner nodes are edge 
nodes connecting more than two substruc- 
tures and those occuring at the boundaries. 


Boundary nodes 



Figure 12: Optimal substructure node designa- 
tion for two-way partition. Corner nodes are 
vertex nodes including those occuring at the 
boundaries. 

and chordwise directions. Figure 9(c) shows an alter- 
native L-D partition. It will be shown that the latter, 
though naturally load balanced for a blade structure, is 
a poor partition and should not be used. 

The partitioner performs the following tasks: 

1. Designates the corner nodes. 

2. Reorders the subdomain nodes into interior, face, 
edge, vertex, and boundary nodes. Recalculates the 
subdomain finite element connectivities. 

3. Sets up domain connectivity maps for substructure 
to substructure communication. 

The first, and second are the key tasks. The third is 
merely a matter of book-keeping. 
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(a) 2-D partitioned blade grid 


(b) Corner and interface nodes 



(ej 1-D partitioned blade grid 


(d) Corner and interface nodes 


Figure 9: Partitioning of blade grid into substructures or subdomains. The corner nodes must ensure 
a null kernel for each substructure. Each substructure will be solved in a separate processor. 
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Figure 13; Convergence of Newton- 
Ftaphson outer iterations for structural 
and forcing non-linearities in hover 


Corner Selection 

For a generic 3D substructure, each interface node 
can be a face, edge, or vertex node. Of these, the edge 
and vertex nodes, that are common to more than two 
substructures are designated as corner nodes. Now con- 
sider the 2-way partition of Pig. 9(a). The nodes on the 
subdomain edges are immediately designated as corner 
nodes. It is clear, however, that this definition makes 
the two substructures at the tip end (or extremelies of 
the tip end in case of more than two chordwise strips) in- 
definite. Each substructure then carries a rotational rigid 
body mode. Thus, the definition of corner nodes most 
include in addition, those edge nodes, that are common 
to only two subdomains, but which occur at the bound- 
aries of the structure. With thus definition, the corner 
nodes are now as shown in Fig. 9(b). This definition 
also enables the selection of corner nodes for the 1-D 
partition in Fig. 9(c), otherwise, there would be no cor- 
ner nodes. For a very large-scale 3D problem, a large 
number of corner nodes is generated by this procedure, 
leading to a, moderately large coarse problem. A superior 
choice of corner nodes for a 3-D problem is simply the 
subdomain vertices, and like before, additionally those 
that occur at the boundaries. There is then always a 
maximum of only 8 corner nodes per subdomain - re- 
gardless of the grid. In this paper, this selection is not 
implemented. We use the previous selection as shown in 
Fig. 0{b). Note, the corner nodes define super elements 
that constitute the coarse problem, and, for a 2-D par- 
tition the optimal selection can leave the super elements 
without internal nodes. The concern for element locking 
under these circumstances require a closer examination 
that has not been carried out yet. 

Node Reorder 

The reordering brings the, interior nodes first, fol- 
lowed by interface nodes, then corner nodes, and lastly 


Root 



Figure 14: Blade steady deflection in 

hover using prescribed pressure airloads 
(only grid outlines are shown) 


the boundary nodes. The procedure depends on the grid, 
partition (1-D or 2-D), and the selection of comer nodes. 
The N elemental nodes in each brick is associated with 
a natural within each substructure. The natural order is 
then associated with a reordered order, and its reverse, 
with an association back to natural. In addition, The 
natural order is associated with the global order as the 
geometry' and material constitution is defined in the lat- 
ter. 


Domain Connectivity 

Domain Connectivity is merely a matter of book- 
keeping. For a Lagrangian problem, the connectivity re- 
mains static, and needs to be calculated only once for 
a grid. Because of the non-floating, non-overlapping, 
and conforming nature of the partitions and elements, 
there are no search, interpolation, or projection require- 
ments, Consequently, there are no errors introduced dur- 
ing substructure to substructure communication. Bach 
substructure carries a destination map and a reception 
map. The destination map contains the substructures to 
which quantities are to be dispatched, and the internal 
node numbers to which they correspond. The reception 
maps contains the substructures from which quantities 
are to be received, and the internal node numbers to 
which they will correspond. 
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Figure 15: Convergence pattern of FETI- 
DP /Preconditioned Conjugate Gradient 
updates for simulations on 4 to 48 pro- 
cessors. 

Steady Hover Prototype 

The steady hover (ideal) prototype simply solves 
for blade response using a prescribed pressure airload 
at a constant collective angle. The non-linear solution 
procedure uses Newton-Raphson outer iterations around 
FETI-DP inner solves. The FETI-DP inner solves use a 
Conjugate Gradient (CG) update in hover. This is ade- 
quate as the stiffness matrix is symmetric. 

Several initial updates of the stiffness matrix are 
necessary to include the non-linear structural stiffness -- 
which provides the key extension-bending non-linearity 
associated with rotation. Figure 13 shows the conver- 
gence of this non-linearity in the initial part of the plot. 
Around 8 to 10 iterations are required for a tight con- 
vergence. Subsequently, the rotor stiffness matrices can 
be updated only at certain intervals if desired, while us- 
ing modified Newton in between.. Once the structural 
non-linearities are converged, the pressure airloads are 
imposed on the blade. The convergence of the airload it- 
erations are shown in the same plot in the right hand side. 
The two parts are separated as conceptually these are 
fluid-structure iterations where the geometric stiffness 
need not be updated. The results shown, however, are 
with fully updated stiffnesses in every iteration. The di- 
rection of the pressure airloads is deformation-dependant 
and unknown a-priori. Thus, equilibrium iterations are 
required. 

In each iteration, the virtual work is calculated based 
on the previous deformation state. An alternative, and 
snore rigorous, approach is to linearize the forcing us- 
ing incremental displacements. This leads to a non- 
syinmetric stiffness contribution, which is 'however eas- 
ily handled by replacing the geometric stiffness ./ with 
1/2(J + J T ) as the rote of the stiffness is only to converge 
the Newton iterations. However, iterations are still nec- 



Figure 16: Convergence of GMRES up- 
dates for various restart numbers; all cal- 
culations use 32 processors. 

essary and therefore we believe the previous configuration 
approach is more efficient. The relatively large deforma- 
tion corresponding to airloads 3 is shown in Fig. 14. 

Within each Newton iteration is the Krylov solver, 
FETI-DP with CG updates. The convergence criteria is 
set tightly to 10” 12 for all cases in this study. Fig. 15 
shows the convergence of the solver when run on 4 to 48 
processors. We will always show the convergence corre- 
sponding to the first Newton iteration im it. begins with 
a zero guess and takes the most number of iterations. 
The scalability (fixed problem size) of these calculations 
are examined in more detail in the next section. Here, 
we note that the number of iterations increase with an 
increase in the number of processors, but, only gradu- 
ally the feature of an algorithm where the precon- 
ditioned interface problem grows slowly, and only at a 
poly logarithmic rate with increase In the number of sub- 
domains. Physically, it means that the increasing coarse 
problem allows a greater transfer of substructure infor- 
mation across the global structure. 

Transient Forward Flight Prototype 

The transient forward flight prototype solves for 
blade response using the same set of prescribed pressure 
airloads as in hover, but now the stiffness and forcing val- 
ues are those that arise out of a single time step in a time- 
marching procedure. We consider a Newmark scheme 
with a 5" azimuth step. The control angle variation is 
taken as 0(%p) = 20 0 + 5° cos# - 5" sin#. The dynamic 
stiffness now contains the complete inertial terms and 
non-symetric due the gyroscopic damping. The FETI- 
DP inner solves now use a Generalized Minimum Resid- 
ua! (GMRES) update. 

The matrix structure will be identical in every time 
step, therefore, for purposes of scalability study it is 
enough to consider only one. Issues related to paralleliza- 
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tion arise immediately due to the presence of the Arnold! 
algorithm within GMRES. This issue, its resolution, and 
its impact on scalability if any, are described and stud- 
ied in detail in the next section. Unlike CG where each 
update require only two previous updates for construc- 
tion, in GMR.ES, each update require information from 
every previous updates. For example, as shown earlier 
in Fig. 15 if 100 iterations are required, 100 such vec- 
tors have to be stored- A standard procedure is to use 
a restarted version where only m previous updates axe 
used at a time. Once a set of m updates are obtained an 
estimate of the solution is constructed. The m updates 
are then thrown away and the construction of a new set 
begins with the current estimate of the solution. 

Figure 16 shows that for the small grid size used 
here, the convergence pattern is similar over a broad 
range of restart parameter - even m = 5 is adequate 
to obtain convergence. For purposes of a realistic scala- 
bility study, however, we will consider m = 30, 40, and 
50. These are deemed adequate for large-scale structural 
configurations with dense interface matrices. 

SCALABILITY OF 3-D ROTOR ANALYSIS 

The scalability of the 3-D FEM parallel solver is re- 
ported here in detail. The first section is relevant to the 
steady hover prototype. The Krylov solver uses a CG 
update here. The idea of substructure optimality and 
definition of scalability is introduced here. The second 
section is relevant to the transient forward flight pro- 
totype. The Krylov solver is equipped with a GMRES 
update here. 

Steady Hover 


n s 

FE I Subdom 
I LU 

Coarse 

FETI-DP 

/PCG 

Solver 

total 

6 

201 

757 

130 

775 

1668 

8 

200 

463 

91 

622 

1180 

! ~rr J 

199 

246 

63 

458 

770 

16 

1.94 

165 

52 

361 

580 

24 

191 

96 

51 

267 

416 

32 

1.90 

66 

71 

213 

350 

48 

190 

39 

168 

187 

395 


Table 10: Solver time vs. number of substructures 
on a single processor 

Consider the grid of size 96 x 4 x 2. It, is partitioned 
into 6 to 48 substructures using a 2-way partitioning, 
i.e. substructure arrangement similar to Fig. 9(a) hav- 
ing 3 X 2 to 24 x 2 blocks. The solver time for each of 
these problems are shown in Fig. 17. The importance 
of safest met u ring is immediately apparent. There is a 
steep drop in solution time with increasing number of 
substructures. For any problem of a fixed size, a condi- 
tion of diminishing return, must eventually be reached, 
with an optimal number of substructure producing the 


n s 

“FE 

Subdom 

LU 

Coarse 

FETI-DP 

/PCG 

Solver 

total 

"TP 

32.6 

121.80 

21.56 

94.07 

"237.87 

8 

24.2 

57.44 

12.33 

57.23 

127,28 

12 

16.2 

20.9-5 

5.80 

26.85 


16 

12.6 

Hj'.m' 

3.64 

14.93 

29.19 

24 

8.18 

4.16 

2.71 

7.96 

14.89 

32 

6.01 

2.20 

2.93 

5.70 

10.85 

48 

4.24 

0.88 

5.62 

5.15 

1 11.70 


Table II: Solver time vs. number of processors; 
each processor contains one substructure 


minimum solution time for that problem size. We shall 
call this the substructure optimality number. For this 
problem it is 32. Note however that the rise In solution 
time beyond the optimality point is not nearly as steep 
as its decline prior to it, and there is a large region over 
which it remains flat. For this problem, this region is 
between 16 to 48 substructures. This flat region is a gift 
of iterative subetructuring. It is shown later that this re- 
gion is sensitive to the partitioning and comer selection 
procedure. A good petitioner and corner selector will 
keep this region flat,, a poor one will produce a steep rise. 
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Figure 17: Solver time vs. number of 
substructures for calculations on a sin- 
gle processor. 


A parallel implementation solves each substructure 
on a separate processor. Note that, it, is important to 
calculate the speed-up obtained, using the single proces- 
sor time with the same number of substructures. That 
is, the solver time with, say 32 processors, must be com- 
pared with the corresponding solver time on a single pro- 
cessor that uses 32 substructures. This is to ensure that 
computations of the same complexity are compared, ot h- 
erwise, the speed-up is contaminated with the benefits of 
substructuring and a super-linear number is always ob- 
tained. This Is because using 32 substructures on a sin- 
gle processor by itself reduces the solver time by more 
than 32 - a fact that has nothing to do with parallelize- 
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Figure 18: Parallel speed-up for calcula- 
tions on multiple processors: each sub- 
structure solved in a separate processor. 


50 r 


40 


CG 

CL 


S 30 ^ 


° 2-D partition! 
$ 1-D partition! 


101 


-O 


Linear speed-up iirre^ <• ' 

\ 


10 


20 30 

No. of processors 


40 


50 


Figure 20: Effect of grid partitioning and 
corner selection on parallel speed-up for 
calculations on multiple processors. 
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Figure 19: Effect of grid partitioning and 
corner selection on solver time as a func- 
tion of number of substructures; calcula- 
tions on a single processor. 


Figure 21: Two different problem sizes 
with the same substructure optimality; 
solver time vs. number of substructures 
on a single processor. 
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lion but substructuring itself. The parallel speed-up is 
shown in Fig. 18, Even for this fixed problem size, it 
has a perfectly linear trend up to the point of substruc- 
ture optimality. Note that the point: corresponding to 32 
processors has nothing to do with the point correspond- 
ing to 16 processors. Indeed, the 32 processor run takes 
only 10.8e whereas the 16 processor ran takes 29.2s, i.e. 
less than half the time. What the speed-up plot shows is 
that 32 processor run the problem exactly 32 times faster 
compared to a single processor using 32 substructures. 
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Figure 22: Parallel speed-up for problem 
sizes with the same substructure opti- 
mality; solver time vs. number of sub- 
structures. 



Figure 23: Two different problem sizes 
with the same substructure optimality; 
solver time vs. number of substructures 
on a single processor. 


The drop off in scalability beyond 32 processors is 
studied using the exact timings for the different parts 
of the computation. The timings for the single proces- 
sor and parallel calculations are given in Tables 10 and 
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Figure 24: Parallel speed-up for problem 
sizes of same substructure optimality. 


11. In the tables, ‘PE’ refers to the time taken to con- 
struct the structural matrices. ‘Solver total’ refers to the 
total solver time. The? two together constitute the to- 
tal simulation time. The ‘Solver total’ time consists of 
three parts: (I) ‘Subdom LU’ time, which refers to the 
subdomain factorization. (2) ‘Coarse’ time, which refers 
to the coarse problem factorization and communication, 
and (3) the ‘FETI-DP/PCG’ time, refers to the Krylov 
solver time. Note that the later includes the computation 
and communication costs of the residual, preconditioned 
and matrix-vector multiplies, and the additional commu- 
nications required for the updates. The communications 
costs are, of course, incurred only during the parallel cal- 
culations. 

From Table 10, which shows the single processor 
timings, the reason behind the flat region in Fig. 17 is 
clear. The growth in the coarse problem is offset by the 
reduction in the Krylov solver time. This is as expected 

- as the purpose of the coarse problem is precisely that 

- but the main point is that the coarse solver should be 
just enough to serve this purpose and no larger, so that 
the substructure optimality is pushed to as high a pro- 
cessor number as possible. Beyond the optimality point, 
any growth in the coarse problem is an indicator of in- 
creased communication, cost for the parallel implemen- 
tation. Note that the coarse problem is solved in every 
processor and as such, requires a global communication. 
The drop off in Fig. 18 is a direct consequence of this 
communication cost. To summarize, a key objective of 
the coarse problem should be to keep the growth beyond 
substructure optimality to as gradual as increase as pos- 
sible. This has no bearing upon scalability with respect 
to problem size, but serves to extend linear speed-up for 
a fixed problem size to as high a processor number as 
possible. 

We illustrate the importance of the coarse prob- 
lem with a worse partitioning. The same problem 
when treated with a 1-way partitioning as illustrated in 
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Figs. 9(c), 9(d), and 10, generates a timing and scala- 
bility piot as shown in Figs. 19 and 20. The 2-wav par - 
titioning results are also plotted for comparison. Clearly, 
from Fig. 19, the same problem now has a substructure 
optimality of 16, as opposed to 32. A good parallel im- 
plementation should gaurantee a linear speed-up at least 
upto 16 processors. Scalability beyond this number is ex- 
pected to be affected adversely by communication costs 
of the coarse problem. This is exactly what is observed 
in Fig. 20. 

The linear speed-up range m not a function of prob- 
lem size but of substructure optimality. For example. 
Figs. 21 and 22, compares the single processor solution 
time and parallel speed-up of a bigger problem of size 
96 x 4 x 4. From the single processor solution time, it 
is clear that the substructure optimality is still 32. As 
a result, the linear speed-up range still extends only up 
to 32. The same conclusions hold for very small problem 
sizes, as shown in Figs. 23 and 24. The smallest grid of 
48 x4x2 still shows a linear speed-up up to 24 processors 
equal to its value of substructure optimality. In the same 
manner as a grid of 4.8 x 4 x 4 which is twice its size. 

In summary, for a given problem size, the present 
solver shows a perfectly linear speed-up - for at least as 
many processors as its substructure optimality. To ex- 
tend this linear speed-up range, a smaller coarse problem 
is required. An example of such a selection was shown 
earlier in Fig. 12. 

The scalability with increasing problem size is il- 
lustrated in Table 12. Problem 2 has twice the size of 
Problem 1. Problem 3 has same size as Problem 2, only 
different grid characteristics. Problem 2 and 3 therefore 
have similar solver times, up to substructure optimality 
which sets in at 24 processors for Problem 3. The timings 
of Problem 1 and Problem 2 provide a simple illustration 
of scalability with increasing size. Because Problem 2 is 
twice the size, twice the number of processors provide an 
approximately same solve time. For example. Problem 
2 on 12 processors take similar time as Problem 1 on 6. 
Problem 2 on 16 take similar time as Problem 1 on 8. 


n v 

Problem 1 
48 x 4 x 2 

Problem 2 
96 x 4 x 2 

Problem 3 
48 x 4 x 4 

“6 

51.52 

237.87 

232.00 

np 

28.39 

.127.28 

126.70 

12 

13.0? 

53.73 

55.43 

16 

8.22 

29.19 

32.72 

24 

5.73 

14.89 

21.69 

32 

5.70 

10.85 

22.28 

48 


11.70 

38.89 


Table 12: Parallel solver times showing scalabil- 
ity with respect to problem size up to limits of 
substructure optimality. 
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Figure 25: Solver time vs. number of sub- 
structures on a single processor: FETI- 
DP/PGMRES. 


Transient Forward Flight 

The same conclusions on effective partitioning and 
substructure optimality are carried over in this section. 
These results, which are very similar to those shown in 
hover, are not repeated. The scalability results of a single 
grid size 96 x 4 x 2 {with substructure optimality of 32) is 
presented here. The results for the other grids compare 
similarly to hover results. 


n p 

Modified GS 

Classical GS 
with Reortb. 

6 

194.08 

190.17 

8 

109.47 

100 47 

12 

41.69 

41.43 

16 

23.92 

24.2 4 

24 

12.33 

11.77 

32 

8.88 

8.61 

48 

10.67 

10.04 


Table 13: Parallel FETI-DP/PGMRES solver 

times (secs.) with Modified Gram-Schimdt and 
Classical Gram-Schimdt with Reorthogonaliza- 
tion based Arnold! procedures. 

The scalability of the parallel FETI-DP/PGMRES 
solver involving the non-symmetric matrices in forward 
flight shows as linear a trend as those of the i’ETl- 
DP/PCC solver in hover. The single processor timings 
are shown in Fig. 25. All calculations use the Reorthogo- 
nalized Classical Gram-Schimdt Arnold: procedure. The 
increasing restart parameters all show the same timings 
on a single processor ~ as expected, because their affect is 
only on memory - but inettrr increasing communcation 
costs for a parallel calculation. However, for the small, 
size problem considered here, there is no discern&bie dif- 
ferences between the three versions even in parallel. As a 
result, the scalability plot shown in Fig. 26 is identical for 
all three cases. Indeed, even the Modified Gram-Schimdt 






Figure 26: Parallel speed-up of FETI- 

DP/PGMRJES solver using Classical 
Gram-Smidt with Reorthogonalization 
based Arnoldi procedure. 

procedure show the same scalability behavior. The lat- 
ter is expected to be drastically inferior for large subdo- 
main problems. Note however, regardless of scalability, 
the actual solution times for Reorthogonalized Classical 
Gram-Schimdt are by themselves lower compare to the 
Modified Gram-Schimdt. This difference is expected to 
be drastic for large scale problems. This trend is dear in 
Table 13, where the former provides a marginally faster 
timing trend. Given the small size of the problem, a 
concrete conclusion is still premature. 

CONCLUDING OBSERVATIONS 

A 3-D FEM analysis for rotary wing dynamics was 
formulated with emphasis on the inertia! terms unique 
to rotorcraft, A dual primal iterative substructuring 
based Krylov solver was developed for a fully parallel 
solution produre. The FETI-DP domain decomposition 
algorithm vfhs used for this purpose. The algorithm was 
equipped with a GMRES update, in addition to its tra- 
ditional CG based implementation, due to the unique 
nan-symmetric nature of the rotary wing inertial terms. 
The scalability of this solver was studied in detail both 
for hover and transient forward flight prototypes. The 
key components of rotorcraft analysis: multibody dy- 
namics, periodic response solution, blade airloads, and 
hence trim were riot part of this study. The focus was 
purely on the scalability of a 3-D FEM based large scale 
structural dynamic analysis. That is the key contribu- 
tion of this paper. The following are the key conclusions 
of this study. 

Key conclusions 

1. Given a fixed problem size, there Is always an op- 
tima! number of substructures or subdomains into 
which the problem can be sub-divided so as to re- 


Figure 27: Parallel speed-up of FETI- 

DP/PGMRES solver using Modified 
Gram-Smidt based Arnoldi procedure. 

quire the minimum solution time. This subdomain 
optimality grows with problem size. Beyond this 
optimality, the benefits of smaller sized subdomains 
are offset by the increasing interface. This has less 
bearing on scalability but is critical for the actual 
solution time. 

2. The present implementation, of the 3D FEM rotor 
code is scalable and shows a linear speed-up. That 
is, an p~ processor calculation with a separate sub- 
structure in each processor takes 1 fp the time com- 
pared to a single processor with ^^substructures. It 
also scales with problem size. That is, a n-t lines 
larger problem takes similar time with n x p proces- 
sors. For a fixed problem size, a drop off in scalabil- 
ity eventually occurs - but not before the subdomain 
optimality number is reached, At that point, there 
is no reason to use any more processors - unless a 
larger problem is attacked in which case, linear 
speed-up is restored again up to the new optimum. 
However, even if more processors than optimum is 
used, the scalability only reaches a plateau, and does 
not show' a dramatic drop off. 

3. This plateau stems from the nature of iterative sub- 
structuring. Unlike classical substructuring, here, 
there is a fiat region in time vs. substructure curve, 
such that the penalty incurred by using more than 
the optimum number of processors is very grad- 
ual. On the other hand, the. difference between 
the plateau and the linear speed-up line stems from 
two practical considerations. First, the subdomain 
to subdomain communication cost, and second, the 
global coarse problem communication cost. 

4. The first penalty is a minor issue that cannot be 
avoided. It can be minimized by by minimizing the 
number of syncroaization points during the Krylov 
updates. This is pertinent more to GMRES where 
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the modified Gram-Scmidt algorithm - the most sta- 
ble procedure for generating the Krylov basis - in- 
currs a significant communication cost. In this paper 
we have compared the modified Gram-Scmidt with 
a cheaper classical Gram-Scmidt. The later requires 
an additional re-orthogonaliziag step that is optional 
and is carried out only when necessary. 

5, The second penalty is a major issue. The size of the 
coarse problem is a key driver of scalability. The 
rule is to select corners which are common to more 
than two subdomains. In 3D this generates a very- 
large number. In addition, depending on the par- 
titioning, there may not be any corner node at all. 
The key idea is that the number of comer nodes can 
vary depending on partitioning but they must be 
selected to ensure nullity of the subdomain kernels. 
Thus, a special, smart partitioner is needed, that 
minimizes the selection of comer nodes while ensur- 
ing the above requirement. Not just any partitioner 
will do. In the present paper this task has been eas- 
ily accomplished manually, due to the simple nature 
of the geometry and grid. 

6. Finally, we note, that the theoretical condition num- 
ber estimates and proof of optimality in domain de- 
composition is based on the assumption of symmet- 
ric ami coercive operators with CG updates. They 
are less developed for non-symmetric systems — and 
are usually based on the assumption of the domi- 
nance of the symmetric operator. One approach is 
to build upon algorithms that are provably optimal 
for the former to extend them to the later. This is 
the procedure followed in this paper. We first ver- 
ify the performance of the algorithms in ideal hover. 
We then extend them to real forward flight condi- 
tions. We equip FET1-DP with GMRES updates 
for this purpose and demonstrated optimal conver- 
gence scalability patterns computationally. We do 
not attempt to provide formal theoretical estimates 
ii! this paper. 

Thrust Areas for Future Research 

A suggested list for future directions of this research 

is given below subdivided into three categories. 

Fundamental Research 

1. Scalable solution for periodic dynamics - temporal 
domain decomposition for boundary' value problems 
in time. 

2. 3-D FEM / multibody dynamics coupled analysis - 
mechanisms and methodology. 

Applied Research 

1. Locking-free elements, hierarchical elements, node- 
less elements to facilitate FEM / multibody dynam- 
ics architechture. 


2. Exact and generic 3-D Fluid-Structure interfaces. 

3. Exact delta coupling procedures for trim solution in 
level and turning flight. 

4. Smart substructuring - efficient comer node selec- 
tion. interface localization, and nodal reordering. 

5. Interfacing with 3-D solid geometry and grid tools, 
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